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Given a labelled tree T, our goal is to group repeating 
subtrees of T into equivalence classes with respect 
to their topologies and the node labels. We present 
an explicit, simple and time-optimal algorithm for 
solving this problem for unrooted unordered labelled 
trees and show that the running time of our method 
is linear with respect to the size of T. By unordered, 
we mean that the order of the adjacent nodes 
(children /neighbours) of any node of T is irrelevant. 
An unrooted tree T does not have a node that is 
designated as root and can also be referred to as 
an undirected tree. We show how the presented 
algorithm can easily be modified to operate on trees 
that do not satisfy some or any of the aforementioned 
assumptions on the tree structure; for instance, how it 
can be applied to rooted, ordered or unlabelled trees. 

1. Introduction 

Tree data structures are among the most common 
and well studied of all combinatorial structures. They 
are present in a wide range of applications, such 
as in the implementation of functional programming 
languages [1], term-rewriting systems [2], programming 
environments [3], code optimization in compiler design 
[4], code selection [5], theorem proving [6] and 
computational biology [7]. 

Thus, efficiently extracting the repeating patterns in 
a tree structure represents an important computational 
problem. Recently, Christou et al. [8] presented a linear- 
time algorithm for computing all subtree repeats in 
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rooted ordered unlabelled trees. Christou et al. [9] extended this algorithm to compute all subtree 
repeats in rooted ordered labelled trees in linear time and space. The authors considered only full 
subtrees, i.e. subtrees which contain all nodes and edges that can be reached from their root. 

The limitation of the aforementioned results is that they cannot be applied to unrooted or 
unordered trees. By unrooted, we mean that the input tree does not have a dedicated root node; 
and, by unordered, we mean that the order of the adjacent nodes (children/neighbours) of 
any node of the tree is irrelevant. Such trees are a generalization of rooted ordered trees, and, 
hence, they arise naturally in a broader range of real-world applications. For instance, unrooted 
unordered trees are used in the field of (molecular) phylogenetics [10,11]. 

Biological motivation. The field of molecular phylogenetics deals with inferring the evolutionary 
relationships among species using molecular sequencing technologies and statistical methods. 
Phylogenetic inference methods typically return unrooted unordered labelled trees that represent 
the evolutionary history of the organisms under study. These trees depict evolutionary 
relationships among the molecular sequences of extant organisms (living organisms) that are 
located at the tips (leaves) of those trees and hypothetical common ancestors at the inner nodes 
of the tree. With the advent of the so-called next-generation sequencing technologies, large-scale 
multi-national sequencing projects such as 1KITE 1 (1000 insect transcriptome sequencing project) 
emerge. In these projects, large phylogenies that comprise thousands of species and massive 
amounts of whole-transcriptome or even whole-genome molecular data need to be reconstructed. 

Provided there is a fixed multiple sequence alignment (MSA) of the sequences — representing 
species — under study, the goal of phylogenetic inference is to find the tree topology that best 
explains the underlying data, using a biologically reasonable optimality criterion — a scoring 
function for the trees. One such optimality criterion is maximum likelihood (ML) [12]. Finding the 
optimal tree under ML is known to be NP-hard [13]. Note that the number of possible unrooted 
tree topologies for n species, located at the tips, grows super-exponentially with n. Therefore, 
widely used tools for ML-based inference of phylogenies, such as RAxML [14] and PHYML [15], 
rely on heuristic search strategies for exploring the immense tree space. 

The likelihood of each candidate tree topology T is calculated by computing the conditional 
likelihoods at each inner node of T. The conditional likelihoods are computed independently for 
each site (column in the MSA). They are computed via a post-order traversal of T starting from a 
virtual root. Note that, as long as the statistical model of evolution is time reversible (i.e. evolution 
occurred in the same way if followed forwards or backwards in time), the likelihood score is 
invariant with respect to where in T the virtual root has been placed. 

For a node k with child nodes i and j, we compute the conditional likelihoods at k for each 
possible state (e.g. A, C, G, T for DNA data) as follows [12]: 



where L K S * is the likelihood of observing the DNA nucleotide state for the subtree rooted at 

k. The function Ps k Si(bi) gives the probability that base evolved into base Sj after time b\ (the 

(k) 

branch length from i to k). If k is a tip (leaf) and consists of a nucleotide, say A, then Lr A := 1.0 and 

(k) (k) (k) 

L c := L G := L T := 0.0. Finally, we compute the overall likelihood for a single site at the virtual 
root of the tree by multiplying the prior probabilities tt x (also called base frequencies) of observing 
a nucleotide state x with the likelihood of that state at the virtual root r 



Once the likelihood for each site has been computed, the overall likelihood of the tree is 
the product over these per-site likelihoods. The tree topology is then modified using some tree 
alteration mechanisms (e.g. the RAxML search algorithm [14]), and the likelihood is computed 
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again for the new tree. After a number of trees has been evaluated (usually millions), the tree 
with the best likelihood is returned. 

In phylogenetic inference software, a common technique for optimizing the likelihood 
function, which typically consumes ^95% of total execution time, is to eliminate duplicate sites 
(equivalent columns in the MSA). This is achieved by compressing identical sites into site patterns 
and assigning them a corresponding weight. This can be done because duplicate sites yield exactly 
the same likelihood iff they evolve under the same statistical model of evolution. When two 
sites are identical, this means that the leaves of the tree are labelled equally. Consider a forest 
of trees with the same topology, where, for each tree, the labels are defined by the molecular data 
stored at a particular site of the MSA and the position of the tips. Knowing equivalent subtrees 
within such a forest would allow someone to minimize the number of operations required to 
compute the likelihood of a phylogenetic tree. This can be seen as a generalization of the site 
compression technique. 

Our contribution. In this article, we extend the series of results presented in [8,9] by introducing 
an algorithm that computes all subtree repeats in unrooted unordered labelled trees in linear time and 
space. The importance of our contribution is underlined by the fact that the presented algorithm 
can be easily modified to work on trees that do not satisfy some or any of the above assumptions 
on the tree structure, e.g. it can be applied to rooted, ordered or unlabelled trees. A preliminary 
version of this article appeared in [16]. 



2. Preliminaries 



(a) Basic definitions 

An unrooted unordered tree is an undirected unordered acyclic-connected graph T = (V,E), 
where V is the set of nodes and E is the set of edges such that E c V x V with \E\ = \ V\ — 1. The 
number of nodes of a tree T is denoted by \T\ := \ V\. An alphabet U is a finite, non-empty set 
whose elements are called symbols. A string over an alphabet U is a finite, possibly empty, string 
of symbols of U. The length of a string x is denoted by \x\, and the concatenation of two strings x 
and y by xy. A tree T is labelled if every node of T is labelled by a symbol from some alphabet E. 
Different nodes may have the same label. 

A tree centre of an unrooted tree T = (V, E) is the set of all vertices such that the greatest node 
distance to any leaf is minimal. An unrooted tree T has either one node that is a tree centre, in 
which case it is called a central tree, or two adjacent nodes that are tree centres, in which case it is 
called a bicentral tree [17]. Let T(T) = (V, A) be the rooted tree on V = V U {r}, where A is defined 
such that \A\ is minimal with (u, v) eA only if [u, v] e E and each node other than r is reachable 
from one central point. If T is a bicentral tree, we add the additional root node r to V and add two 
edges to A, namely (r, v) and (r, u), where v and u are the central points of T — the edge between 
its two central points is not added. Otherwise, if T is a central tree, with tree centre u, we set r := u 
and thus V=V. 

We call u e V a child of v iff (v, u) e A. In this case, we call v the parent of u and define 
parent(u) := v. We call u and u' siblings iff there exists a node v e V such that (v,u), (v,u r ) eA. 
Note that under this definition two central points of a bicentral tree are siblings of each other. 

The (rooted) subtree that contains node v as its root node, obtained by removing edge {v,u}, 
is denoted by T(v,u). We consider only full subtrees, i.e. subtrees which contain all nodes and 
edges that can be reached from i; when only the edge {v, u] is removed from the tree. The special 
case T(v, v) denotes the tree containing all nodes that is rooted in v. For simplicity, we refer to 
T(v,parent(v)) as T(v). The diameter of an unrooted tree T is denoted by d(T) and is defined as the 
number of edges of the longest path between any two leaves (nodes with degree 1) of T. The height 
of a rooted (sub)tree T(v, u) of some tree T, denoted by h(v, u), is defined as the number of edges on 
the longest path from the root v to some leaf of T(v, u). The height of a node v, denoted by h(v), is 
defined as the length of the longest path from v to some leaf in T(T). 



(a) (b) 




Figure 1. (a) An unrooted tree T consisting of 10 nodes; a non-overlapping subtree repeat R = {(3, 2), (4, 1)} is marked with 
dashed rounded rectangles; another non-overlapping subtree repeat containing the trees f(1, 2), f(2, 1) is marked with dashed 
rectangles, (b) An overlapping subtree repeat R = {{2, 3), (1, 4)} of T resulting from the deletion of the dashed edge and its 
corresponding dotted subtree. This is an overlapping subtree repeat since nodes 1 and 2 — and the node labelled by c — are in 
both subtrees. A total repeat R = {(1, 1), (2, 2)} of T can be obtained by keeping all the edges and rooting T in node 1 (f (1, 1)) 
and node 2 (f (2, 2)), respectively. 



For simplicity, in the rest of the text, we denote: a rooted unordered labelled tree by f, an 
unrooted unordered labelled tree by T and the rooted (directed) version of T by T(T), as defined 
above. 

(b) Subtree repeats 

Two trees T\ = (Vi,Ai) and T2 = (V 2/ A 2 ) are equal, denoted by T\ = T 2 , if there exists a bijective 
mapping/ : V\ -> V 2 such that the following two properties hold: 

(vi, v 2 ) e A 1 & (f(vi), f(v 2 )) e A 2 
label(u) = label(/(u)), Vu € V\ . 

A subtree repeat R in a tree T is a set of node tuples {u\, v\ ),..., (w^, v\r\), such that T(ui, v\) = • • • = 
T(u\r\,v\r\). We call \R\ the repetition frequency of R. If \R\ = 1 we say that the subtree T{u\,v\) 
does not repeat. An overlapping subtree repeat is a subtree repeat R, where at least one node v is 
contained in all |R| trees. If no such v exists, we call it a non-overlapping subtree repeat. A total 
repeat R is a subtree repeat that contains all nodes in T, that is, R = {{u\, u\), . . . , (u\r\,u\r\)}. See 
figure 1 in this regard. 

In the following, we consider the problem of computing all such subtree repeats of an unrooted 
tree T. 

3. Algorithm 

The algorithm works in two stages: the forward /non-overlapping stage and the backward/ 
overlapping stage. The forward stage finds all non-overlapping subtree repeats of some tree T. 
The backward stage uses the identifiers assigned during the forward stage to detect all 
overlapping subtree repeats, including total repeats. 

(a) The forward/non-overlapping stage 

We initially present a brief description of the algorithmic steps. Thereafter, we provide a formal 
description of each step in algorithm 1. This algorithm resembles the one in [18] for deciding tree 
isomorphism. 

In the following, we identify each node in the tree by a unique integer in the range of 1 to |T|. 
Such a unique integer labelling can be obtained, for instance, by a pre- or post-order tree traversal. 



The basic idea of the algorithm can be explained by the following steps: 



(i) Partition nodes by height. 

(ii) Assign a unique identifier to each label in E. 

(iii) For each height level starting from 0 (the leaves). 

(1) For each node v of the current height level construct a string containing the identifier 
of the label of v and the identifiers of the subtrees that are attached to v. 

(2) For each such string, sort the identifiers within the string. 

(3) Lexicographically sort the strings (for the current height level). 

(4) Find non-overlapping subtree repeats as identical adjacent strings in the 
lexicographically sorted sequence of strings. 

(5) Assign unique identifiers to each set of repeating subtrees (equivalence class). 

We will explain each step by referring to the corresponding lines in algorithm 1. 

Partitioning the nodes according to their height requires time linear with respect to the size 
of the tree and is described in line 2 of algorithm 1. This is done using an array H of queues, 
where H[i], for all 0 < i < \d{T)/2\ r contains all nodes of height i. Thereafter, we assign a unique 
identifier to each label in U in lines 3-7. The main loop of the algorithm starts at line 8 and 
processes the nodes at each height level starting bottom-up from the leaves towards the central 
points. The main loop consists of four steps. First, a string is constructed for each node v which 
comprises the identifier for the label at v followed by the identifiers assigned to u\, Ui, . . . , u Cv . The 
identifiers of u\, U2,..., u Cv represent the subtrees f{u\) r T{u2), . . . , T(u Cv ), where u\, u-i, . . . , u Cv are 
the children of v (lines 11-16). Assume that this particular step constructs k strings s\,S2, • • • ,s^. 

In the next step, we sort the identifiers within each string. To obtain this sorting in linear time, 
we first need to remap individual identifiers contained as letters in those strings to the range 
[l,m]. Here, m is the number of unique identifiers in the strings constructed for this particular 
height, and the following property holds: m < Y^=i \ s i\- We then apply a bucket sort to these 
remapped identifiers and reconstruct the ordered strings r\, ri, . . . , ffc (lines 17-20). 

The next step for the current height level is to find the subtree repeats as identical strings. 
To achieve this, we lexicographically sort the ordered strings r\,ri, . . .,rjt (line 22), and check 
neighbouring strings for equivalence (lines 23-33). For each equivalence class 1Z{ we choose a 
new, unique identifier that is assigned to the root nodes of all the subtrees in that class (lines 26 
and 33). Finally, each set IZi contains exactly the tuples of those nodes that are the roots of a 
particular non-overlapping subtree repeat of T and their respective parents. 

Remapping from T>\ — [1, \T\ + \Z\) to T>2 = [1, \H[i]\ + J2veH[i] can ^ e done using an array 
A of size \T\ + a counter m and a queue Q. We read the numbers of the strings one by one. 
If a number x from domain V\ is read for the first time, we increase the counter m by one, set 
A[x] := m, and place m in Q. Subsequently, we replace x by m in the string. In case a number x 
has already been read, that is, A[x] ^ 0, we replace x by A[x] in the string. When the remapping 
step is completed, only the altered positions in array A will be cleaned up, by traversing the 
elements of Q. 

Theorem 3.1 (Correctness). Given an unrooted tree T, algorithm 1 correctly computes all non- 
overlapping subtree repeats. 

Proof. First note that if any two subtrees T\ and T2 are repeats of each other, they must, by 
definition, be of the same height. So the algorithm is correct in only comparing trees of the 
same height. Additionally, non-overlapping subtree repeats of a tree T can only be of height 
ld(T) /2J or less, where d(T) is the diameter of T. Therefore, the algorithm is correct in stopping 
after processing all L^(T)/2J + 1 height classes, in order to extract all the non-overlapping subtree 
repeats. As the algorithm only extracts non-overlapping repeats, we define repeats to mean non- 
overlapping repeats for the rest of this proof. In addition, for simplicity, we consider the rooted 
version of T for the rest of this proof. 



Algorithm 1. FORWARD STAGE 



Input : Unrooted tree T = (V,E) labelled from Z 
Output: Sets 1Z re ps of non-overlapping subtree repeats of T 

1 > Partition tree nodes by height 

2 for allveV do Compute h(v) and ENQUEUE(H[/i(u)], v) 

3 cnt <- 0 

4 > Assign a number from 1 to | U | to each label 

5 for all labels i e U do 

6 cnf ^cnt + \ 

7 L[£] <- cnf 

8 > Compute subtree repeats 

9 reps «— 0 

10 for i +- 0 to L^(T)/2J do 



b 
• ^3 



11 
12 
13 
14 
15 
16 

17 
18 
19 
20 
21 

22 

23 

24 
25 
26 
27 
28 
29 

30 
31 

32 
33 



> Construct a string of numbers for each node v and its children 
foreach v e H[i] do 

Let children(v) = { u | {u, v] e E } \ {parent(v)} and c v — \children(v)\ 
s v L[label(v)]K[ui]K[u2] . . . K[u Cv ], if children(v) = {u\,ui, . . . , u c J 
S^SU{s v ] 

> Remap numbers [1, \T\ + to [1, |H[z]| + J2veH[i] c v] 
R +- Remap(S) 

> Bucket sort strings 

Bucket sort the (unique) numbers of all strings in R. 

Let R r be the set of individually sorted strings that have been extracted from the 
respective sorted list from the previous step. 

Lexicographically sort the strings in R f using radix sort and obtain a sorted list R" of 
strings ri,r 2 , . . . ,t\r»\. 



Let each r\ be of the form k\kj . . 
of the form L[v\]K[v[] . . . K[v\ ri] ] 
reps +- reps + 1 
Kreps *~ {(v\,parent(v\))} 
K[v\] reps + cnt 
for j ' 2 to k do 
if rj =Tj-\ then 

7Z>reps *~ 
else 

reps <- reps + 1 

K-reps *~ {(v{,parent(v[))} 
Klv^] reps + cnt 



. k\ r , and the corresponding, original unsorted string s z - 



■ Kreps U {(v , 1/ parent(v[))} 



We show that the algorithm correctly computes all repeats for a tree of any height by induction. 
For the base case, we consider an arbitrary tree of height 1 (trees with height 0 are trivial). Any 
tree of height 1 only has the root node and any number of leaves attached to it. At the root we 
can never find a subtree repeat, so we need only to consider the next lower (height) level, that 
is, the leaf nodes. Any two leaves with identical labels will, by construction of the algorithm, be 
assigned the same identifiers and thus be correctly recognized as repeats of each other. 



Now, assume that all (sub)trees of height m — 1 have correctly been assigned with identifiers 
by the algorithm and that they are identical for two (sub)trees iff they are unordered repeats of 
each other. 

Consider an arbitrary tree of height m + 1. The number of repeats for the tree spanned from 
the root (node r) is always one (the whole tree). Now consider the subtrees of height m. The root 
of any subtree of height m must be a child of r. For any child of r that induces a tree of height 
smaller than m, all repeats have already been correctly calculated according to our assumption. 

Two (sub)trees are repeats of each other iff the two roots have the same label and there is a 
one-to-one mapping from subtrees induced by children of the root of one tree to topologically 
equivalent subtrees induced by children of the root of the second tree. By the induction 
hypothesis, all such topologically equivalent subtrees of height m — 1 or smaller have already 
been assigned identifiers that are unique for each equivalence class. Thus, deciding whether two 
subtrees are repeats of each other can be done by comparing the root labels and the corresponding 
identifiers of their children, which is exactly the process described in the algorithm. The approach 
used in the algorithm correctly identifies identically labelled strings since the order of identifiers 
has been sorted for a given height class. Thus, the algorithm finds all repeats of height m (and 
m + 1 at the root). ■ 

Theorem 3.2 (Complexity). Algorithm 1 runs in time and space 0(\T\). 

Proof. We prove the linearity of the algorithm by analysing each of the steps in the outline of 
the algorithm. Steps (i) and (ii) are trivial and can be computed in \T\ and \E\ steps, respectively. 
Note that |27| < |T|. 

The main for loop visits each node of T once. For each node v a string s v is constructed which 
contains the identifier of the label of v and the identifiers assigned to the child nodes of v. Thus, 
each node is visited at most twice: once as parent and once as child. This leads to In — 1 node 
traversals, where n is the number of nodes of T, since the root node is the only node that is visited 
exactly once. The constructed strings for a height level i are composed of the nodes in H[i] and 
their respective children. In total, we have c(i) := J2veH[i\ c v child nodes at a height level i, where 
c v is the number of children of node v. Therefore, the total size of all constructed strings for a 
particular height level i is \H[i] \ + c(i). Step (iii)(2) runs in linear time with respect to the number 
of nodes at each height level i and their children. This is because the remapping is computed 
in linear time with respect to \H[i]\ + c{i). By the remapping, we ensure that the identifiers in 
each string are within the range of 1 to \H[i] \ +c{i). Using bucket sort, we can then sort the 
remapped identifiers in time \H[i] \ + c(i) for each height level i. Consequently, the identifiers 
in each string can be sorted in time \H[i] \ + c(i) by traversing the sorted list of identifiers and 
positioning the respective identifier in the corresponding string on a first-read-first-place basis. 
This requires additional space \H[i] \ + c(i) to keep track of which remapped identifier corresponds 
to which strings. 

After remapping and sorting the strings, finding identical strings as repeats requires 
a lexicographical sorting of the strings. Strings that are identical form classes of repeats. 
Lexicographical sorting (using radix sort) requires time (D(|H[z]| + c(z)) and at most space for 
storing \T\ + \U\ elements since the identifiers are in the range of 1 to \T\ + This memory 
space needs to be allocated only once. Moreover, the elements that have been used are 
cleared /cleaned-up at each step via a queue as explained for the remapping function. 

By summing over all height levels, we obtain ^}f^^(|H[z]| + c(z)) = In — 1. Thus, the total 
time over all height levels for each step described in the loop is 0(\T\). The overall time and space 
complexity of the algorithm are thus G{ \ T | ) . ■ 

We conclude this section with an example demonstrating algorithm 1. Consider the 
tree T from figure 2. The superscript indices denote the number associated with each 
node, which, in this particular example, correspond to a pre-order traversal of T(T) by 
designating node 1 as the root. Lines 1 and 2 partition the nodes of T in \d(T)/2\ + 1 
sets according to their height. The sets H[0] = {3, 5, 6, 7, 8, 10, 11, 13, 14, 15, 17, 19, 20, 23, 25, 26, 28}, 




H[l] = {4,12,18,22,24,27}, H[2] = {2,9,21} and H[3] = {1 / 16} are created. Lines 5-7 create a 
mapping between labels and numbers. L[a] = 1, L[fr] = 2, L[c] = 3 and L[d] = 4. Table 1 shows the 
state of lists S, R, R f , R ff during the computation of the main loop of algorithm 1 for each height 
level, where S is the list of string identifiers, R is the list of remapped identifiers, R f is the list of 
individually sorted remapped identifiers and R ff is the list R f lexicographically sorted. Figure 3 
depicts tree T with the respective identifiers for each node as assigned by algorithm 1. 



(b) The backward/overlapping stage 

Definition 3.3 (Sibling repeat). Given an unrooted tree T, two equal subtrees of T(T) whose 
roots have the same parent are called a sibling repeat. 

Definition 3.4 (Child repeat — recursively defined). Given an unrooted tree T, two subtrees of 
T(T) whose roots have the same identifiers and whose root's respective parents are roots of trees 
in the same sibling or child repeat are called a child repeat. 



Table 1. State of lists 5, /?, R' , R" for each height level and resulting sets H reps of non-overlapping subtree repeats. 



step 


process 


repeats 


strings: 5 


2,1,3,2,4,4,2,2,3,11,2,3,4,2,3,4 


ft, = {3,7,11,13,19,25} 


remapping:/? 


1,2,3,1,4,4,1,1,3,2,2,1,3,4,1,3,4 


K 2 = {5,15,17} 


sorting: /? r 


1,2,3,1,4,4,1,1,3,2,2,1,3,4,1,3,4 


TZ 3 = {6,14,20,26} 


repeats: R" 


1,1,1,1,1,1, 2,2,2,3,3,3,3,4,4,4,4 

5 6 7 8 


K 4 = {8,10,23,28} 


strings: 5 


3675,3576,257,18,257,1 8 




remapping:/? 


1234,1432,543,67,543,67 


7^7 = {22, 27} 


sorting: /? r 


1234,1234,345,67,345,67 


lis = {4,12} 


repeats: R" 


1234,1234,345,345,67,67 

9 10 11 


Tl6 = {18,24} 


strings: 5 


1598,1859,1111011 


ft 8 = {2,9} 


remapping:/? 


1234,1423,1565 


= {21} 


sorting: /? r 


1234,1234,1556 




repeats: R" 


1234,1234,1556 

12 13 




strings: 5 


261013,11212 


TZm = 116} 


remapping:/? 


1234,566 


Ha = {1} 


sorting: /? r 


1234,566 




repeats: R" 


r234,^ 





14 15 



Note that with these definitions we get that two trees with roots u and v, respectively, are child 
or sibling repeats of each other iff the unique path between nodes u and v is symmetrical with 
respect to the node identifiers of the nodes traversed on the path. 

The two following lemmas illustrate why it is necessary and sufficient to know the identifiers 
from the forward stage to compute all overlapping subtree repeats. 

Lemma 3.5 (Sufficient conditions). Let r be the parent ofu and v, where u and v are roots of a sibling 
repeat. Then the trees T(u, u) and T(v, v) are elements of the same total repeat. The trees T(r, u) and T(r, v) 
are elements of the same overlapping subtree repeat. 

Let u and v be roots of a child repeat. Furthermore, let r u and r v be the parents ofu and v, respectively. 
Then the trees T(u, u) and T(v, v) are elements of the same total repeat, and the trees T(r U/ u) and T(r v , v) 
are elements of the same overlapping subtree repeat. 

Proof. Trivial, by inspection; see figure 2. ■ 

In figure 2, the trees T(2, 1) and T(9, 1) form a sibling repeat, thus the trees T(4,2) and T(12,9) 
form a child repeat. From the sibling repeat, we get that T(2, 2) and T(9, 9) are elements of a total 
repeat, while T(l, 2) and T(l, 9) are within the same overlapping repeat. Analogously, for the child 




Figure 4. T(v 2l v k ) = T{u 2l u k ) is an overlapping repeat iff T{u kl u 2 ) = T{v kl v 2 ) is a child repeat, which is true iff 

identifier(u k ) = identifier(v k ) f identifier(u 2 ) = identifier(v 2 ) f identifier^) = identifier^). 



repeat we get the trees T(4,4) and T(12, 12) as total repeats and {(2,4), (9, 12)} as an overlapping 
repeat. 

Note that lemma 3.5 implies that all nodes of a subtree that is an element of an overlapping 
subtree repeat with repetition frequency \R\ are roots of trees in overlapping repeat classes of 
frequency at least |R|. 

Lemma 3.6 (Necessary conditions). Any two trees that are elements of a total repeat must have been 
assigned the same identifiers at their respective roots during the forward stage, and be rooted in roots of 
either sibling or child repeats. 

Any two trees that are elements of an overlapping subtree repeat, but not of a total repeat, must have 
been assigned the same identifiers at their respective roots during the forward stage, and be rooted in parents 
of roots of either sibling or child repeats. 

Proof. We first look at the case of total repeats. Let T(u, u) = T(v, v). We now consider the unique 
path p between u and v. Obviously, for equality among these two trees to hold, the path must be 
symmetrical, which by recursion implies that u and v are roots of either sibling or child repeats 
(figure 4). 

The case of other overlapping subtree repeats works just the same. Let T(r u ,u) = T(r v , v) not 
be total, but an overlapping subtree repeat. These trees are obtained by removing a single edge 
from the tree: {r u , u] and {r v , v], respectively. Let p be the path between u and v. Since the trees are 
elements of an overlapping subtree repeat, r u and r v must lie on this path. Additionally, since r-u 
and r v are on the path from u to v, h(v) = h(u), and since any tree is acyclic, then r u and r v must 
be closer to the central points than u and v, respectively. Since there is an edge connecting r u with 
u and r v with v this means that r u and r v are parents of u and v, respectively. Again, the path p is 
symmetrical with respect to the node labels of nodes along the path, so u and v are roots of either 
sibling or child repeats. ■ 

Given these two lemmas, we can compute all overlapping subtree repeats by checking for 
sibling and child repeats. This can be done by comparing the identifiers assigned to nodes in the 
forward stage. The actual procedure of computing all overlapping subtree repeats is described in 
algorithm 2. Algorithm 2 takes as input an unrooted tree T that has been processed by algorithm 1, 
i.e. each node of tree T has already been assigned an identifier according to its non-overlapping 
repeat class. 



First, the algorithm considers the rooted version of T, that is, T(T). This is done since many 
operations and definitions rely on T(T). Next, we define a queue Q, whose elements are sets 
of nodes. Initially, Q contains only the set containing the root node of T(T) (line 2). Processing 
Q is done by dequeuing a single set of nodes at a time (lines 5-16). For a given set U of Q, 
the algorithm creates a set I containing the identifiers of children of all the nodes in U. Then, the 
algorithm remaps these identifiers to the range of [1, \I\] constructing a new set V (line 12). Then, 
we construct a list C of tuples, such that each tuple contains the remapped identifier of a child and 
the corresponding node. Therefore, we can use bucket sort to sort these tuples by the remapped 
identifiers in time linear in the cardinality of I. 

We are now in a position to apply lemmas 3.5 and 3.6. By lemma 3.6, finding sibling and child 
repeats is done by creating sets of nodes with equivalent identifiers in C (line 18). This can be 
easily done because of the sorting part of the algorithm. These sets are then enqueued in Q, and, 
by lemmas 3.5 and 3.6, all resulting subtree repeats (overlapping and total) are, thus, created 
(lines 21-22). Hence we immediately obtain the following result. 



Algorithm 2. BACKWARD STAGE 



Input : Unrooted tree T = (V,E) labelled from E with identifiers assigned by algorithm 1 
Output: Sets TZ'reps of overlapping subtree repeats of T 

1 > Initialize queue Q with the root node r of T(T) 

2 Enqueue(Q, {r}) 

3 > Compute overlapping subtree repeats 

4 while Queue-Not-Empty(Q) do 
U <r- Dequeue(Q) 

> Get the identifiers of the children of the nodes in U 
Let cod(U) be the cumulated out degree of all the nodes in U 
Let children(U) = {u\, ui,..., u co d(u)} be the children of the nodes in U 
Let ids(children(U)) = i<i, . . . , i C od(U)} be the identifiers of {u\, ui, . . . , w c0 d(iz)} 
I <— ids(children(U)) 

> Remap numbers [1, |T| + \E\) to [1, \I\] 
I f +- Remap(J) 

Let V = {iy i f 2 , . . . , i f co d(U)} ^ e * ne rema PP e d identifiers of {u\, ui, . . . , u co ^u)} 
Let C =< i v ui >, < i f 2 , u 2 >,..., < i f C0 d(uy u cod(U) > be a list of tuples 

> Bucket sort the remapped identifiers 
Bucket sort the list C by i f v i' 2 , . . . , i f C0 ^ U y 

> Extract the equivalence classes 

foreach E = {di, V2, . . . , v^} of nodes with equivalent identifiers in C do 
Enqueue(Q,E) 
for i <- 1 to k do 

K'reps *~ K>' revs U {{parent{Vi), Vj)} 
H'reps+1 fc'reps+l U {(^'/ ^)} 

reps reps + 2 



Theorem 3.7 (Correctness). Given an unrooted tree T with identifiers assigned by algorithm 1, 
algorithm 2 correctly computes all overlapping subtree repeats, including total repeats. 

Algorithm 2 enqueues each node of T once. For each enqueued node, a constant number of 
operations is performed. Therefore, we get the following result. 

Theorem 3.8 (Complexity). Algorithm 2 runs in time and space 0(\T\). 



4. Properties 
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In this section, we provide properties which could potentially be used to speed up an 
implementation of the above algorithms. 

Property 4.1 (Trivial path). If we find a non-overlapping subtree with repetition frequency 1, no node 
that lies on the path from the root of that subtree to the furthest central point {including the central point 
itself) can have a repetition frequency other than lfor non-overlapping subtree repeats rooted in this node. 
We call this path the trivial path. 

Proof The proof is trivial. Assume that some node v on the trivial path would induce a non- 
overlapping subtree repeat with frequency higher than 1. By definition, all subtrees of the subtree 
rooted in v must be contained in all subtrees in this repeat. In particular, the original subtree with 
repetition frequency 1. ■ 

The implications for implementations are obvious. Any time we encounter a subtree with 
repetition frequency 1, we can mark all nodes on the trivial path as trivial, and add them to their 
own repeat class. 

Property 4.2 (Inclusion of trivial path). All trees from overlapping subtree repeats with repetition 
frequency higher than 1 must contain all nodes that lie on any trivial path. 

Proof. The proof is trivial. We prove the property by contradiction. Let v be a node on the trivial 
path. Then, by construction of overlapping subtree repeats only a single subtree in the repeat 
contains v. However, since v is on the trivial path, there must be a subtree without repeats induced 
by it. That is, no other subtree in the same overlapping repeat can have this subtree included, 
which contradicts the equality among trees. ■ 



5. Conclusion 

We presented a simple and time-optimal algorithm for computing all full subtree repeats in 
unrooted unordered labelled trees, and showed that the running time of our method is linear 
with respect to the size of the input tree. 

The presented algorithm can easily be modified to operate on trees that do not satisfy some or 
any of the aforementioned assumptions on the tree structure. 

— Rooted trees: in a rooted tree T, only non-overlapping repeats can occur. Therefore, it is 
sufficient to apply algorithm 1 with the following modifications: first, we define T(T) := T; 
second, the main for loop must iterate over the height of T, instead of depending on its 
diameter. 

— Ordered trees: if for a node the order of its adjacent nodes is relevant, i.e. the tree is ordered, 
the bucket sort procedures in algorithms 1 and 2 must be omitted. Additionally, sibling 
repeats must not be merged in line 19 of algorithm 2 but rather be enqueued separately. 

— Unlabelled trees: trivially, an unlabelled tree can be seen as a labelled tree with a single 
uniform symbol assigned to all nodes. 

Algorithm 1 can also be used to compute subtree repeats over a forest of rooted unordered trees. 
The method is the same as for the case of a single tree. The method reports all subtree repeats by 
clustering the identifiers of equal subtrees from all trees in the forest into an equivalence class. 
The correctness of this approach can be trivially obtained by connecting the roots of all trees 
in the forest with a virtual root node, and applying the algorithm to this single tree. This solves 
the problem involved in the concrete application scenario discussed in §1. 

Algorithm 1 can also be directly applied to solve the maximal leaf-agreement isomorphic 
descendant subtree (MLAIDS) problem [19]. MLAIDS is defined as follows: given a set of k 
phylogenetic (evolutionary) trees, find k maximal subtrees from the given trees, such that the 
leaves as well as the structure of the subtrees are equal. 
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